From Phonology to Syntax: Unsupervised Linguistic Typology at Different Levels with Language Embeddings
نویسندگان
چکیده
A core part of linguistic typology is the classification of languages according to linguistic properties, such as those detailed in the World Atlas of Language Structure (WALS). Doing this manually is prohibitively time-consuming, which is in part evidenced by the fact that only 100 out of over 7,000 languages spoken in the world are fully covered in WALS. We learn distributed language representations, which can be used to predict typological properties on a massively multilingual scale. Additionally, quantitative and qualitative analyses of these language embeddings can tell us how language similarities are encoded in NLP models for tasks at different typological levels. The representations are learned in an unsupervised manner alongside tasks at three typological levels: phonology (grapheme-to-phoneme prediction, and phoneme reconstruction), morphology (morphological inflection), and syntax (part-of-speech tagging). We consider more than 800 languages and find significant differences in the language representations encoded, depending on the target task. For instance, although Norwegian Bokmål and Danish are typologically close to one another, they are phonologically distant, which is reflected in their language embeddings growing relatively distant in a phonological task. We are also able to predict typological features in WALS with high accuracies, even for unseen language families.
منابع مشابه
What is Phonological Typology?
In this talk I am concerned with the following questions: 1. What is phonological typology? 2. How are phonological typology and phonetic typology the same/different? 3. How are phonological typology and general phonology the same/different? 4. How are phonological typology and general typology the same/different? Despite earlier work by Trubetzkoy, Jakobson, Martinet, Greenberg and others, and...
متن کاملCode-Copying in the Balochi Language of Sistan
This empirical study deals with language contact phenomena in Sistan. Code-copying is viewed as a strategy of linguistic behavior when a dominated language acquires new elements in lexicon, phonology, morphology, syntax, pragmatic organization, etc., which can be interpreted as copies of a dominating language. In this framework Persian is regarded as the model code which provides elements for b...
متن کاملLinguistic Typology and Formal Grammar
The goal of this chapter is to provide an overview of the relationship between linguistic typology and formal grammar—a relationship that has existed for several decades now and is unlikely to disappear any time soon. As the reader will see, the two orientations differ in a number of respects, but they share the custody of language, and that motivates the need for communication between the two....
متن کاملExperiments in Unsupervised Learning of Natural Language
Linguistics has invented and discarded many theories of language, and there are currently many competitors to the basic idea of phrase structure grammars as capturing the syntactic structure of language. Computational Linguistics has proven to be a testing ground for theories and grammars, and is similarly diverse. Moreover recently we have learnt that the similar principles and techniques may ...
متن کاملCross-linguistic Influence at Syntax-pragmatics Interface: A Case of OPC in Persian
Recent research in the area of Second Language Acquisition has proposed that bilinguals and L2 learners show syntactic indeterminacy when syntactic properties interface with other cognitive domains. Most of the research in this area has focused on the pragmatic use of syntactic properties while the investigation of compliance with a grammatical rule at syntax-related interfaces has not received...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1802.09375 شماره
صفحات -
تاریخ انتشار 2018